<div class='content'>
<p>目录：二代测序基础 | 课题组 | 分子生物学 | 数据源 | 数据分析工具 | 基因分类 </p>

	<h2>二代测序基础</h2>
	<p><a href="http://www.tudou.com/programs/view/NUYVq0eMRas/" class="external text" target="_blank">陈巍学基因</a>  系列视频。</p>
	
		
	
	
	
	<h2>课题组</h2>
	<p>做alternative cleavage and polyadenylation (APA)研究的: <a href="http://exon.umdnj.edu/lab_home/index.html" target="_blank">Bin Tian 课题组</a> | </p>
	<p><a href="http://blog.sina.com.cn/s/articlelist_1728333092_0_1.html" target="_blank">生信博客</a> 现在已经转为 http://qinqianshan.com/</p>
	<p><a href="http://liulab.dfci.harvard.edu/" target="_blank">X. Shirley Liu Lab</a> - 哈佛大学 刘小乐课题组</p>

	<p><a href="http://www.bio-info-trainee.com/" target="_blank">生信菜鸟团博客</a>，<a href="http://www.biotrainee.com/" target="_blank">生信技能树论坛</a>，微信（biotrainee）：生信菜鸟团：生物信息学学习资料分析，常见数据格式及公共数据库资料分享。常见分析软件及流程，基因检测及癌症相关动态。</p>
	<p><a href="http://guangchuangyu.github.io/" target="_blank">余光创</a>，Y叔在我还没进入生物信息学领域的时候就已经名驰海外(让我印象最深的就是他博客里面的世界地图，满满的各种国旗！)，他的<a href="https://github.com/GuangchuangYu" target="_blank">github</a>, 他的<a href="https://guangchuangyu.github.io/cn/" target="_blank">个人博客</a>。</p>
<pre>
12个可视化包： http://guangchuangyu.github.io/
简历： https://guangchuangyu.github.io/resume/
豆瓣读书： http://guangchuangyu.github.io/cn/douban/
关注Y叔微信公众账号biobabble

编程问题集合： http://www.codeabbey.com/index/task_list
生信微信号集合： http://www.360doc.com/content/17/0305/22/19913717_634279918.shtml 
蛋白PPI绘图：http://mp.weixin.qq.com/s?__biz=MzI4NzExMjU0Mw==&mid=2651105971&idx=1&sn=63a4c92b7c65e2e26d6e6982e0309a47&scene=21#wechat_redirect 

</pre>
	
	

	<h2>分子生物学</h2>
	<h3>蛋白编码基因结构</h3>
	<img src='data/NGS/images/gene_structure.png'>
	
	<img src='data/NGS/images/gene_structure2.jpg'>
	<p>INR: Initiator Region.</p>
	<p>TSS: trascriptional start site. (or <a target="_blank" href="https://www.researchgate.net/post/Does_anyone_have_a_clear_illustration_of_a_gene_Does_TSS_first_exon">first exon</a>)</p>
	
	<p>TSS: 转录起始位点（transcription start site）。在一个典型的基因内部，排列顺序为转录起始位点(TSS，一个碱基)-起始密码子编码序列 (ATG)-终止密码子编码序列(TGA)-转录终止位点  transcription termination site (TTS) ，即TSS-ATG-TGA-TTS</p>
	
	
	<h3>TPM / PKRM</h3>
	<img src='data/NGS/images/TPM.png'>
	
		
	<h3>APA, polyadenylation site, TF</h3>
	<img src='data/NGS/images/polyadenylation_site.jpg'>
	
	
	
	
	
	<h2>数据源</h2>
	<p>参考基因组 
<a href="http://support.illumina.com/sequencing/sequencing_software/igenome.html" target="_blank">igenome</a>:下载很慢很难<br>
	<ol>
		<li><a href="ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/UCSC/hg38/Homo_sapiens_UCSC_hg38.tar.gz" target="_blank">Homo_sapiens_UCSC_hg38</a>[16006984068byte=14.907Gb],</li>
		<li><a href="ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/NCBI/GRCm38/Mus_musculus_NCBI_GRCm38.tar.gz" target="_blank">Mus_musculus_NCBI_GRCm38</a>[29516673107byte=27.48Gb],</li>
		<li><a href="ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/UCSC/hg19/Homo_sapiens_UCSC_hg19.tar.gz" target="_blank">Homo_sapiens_UCSC_hg19</a>[45468620403byte=42.34Gb],</li>
	</ol>
</p>



	<p>TCGA:<a href="http://cancergenome.nih.gov/" target="_blank">TCGA</a>:The Cancer Genome Atlas - Cancer Genome</p>
	
	<p>Gene Expression Omnibus, <a href="https://www.ncbi.nlm.nih.gov/geo/" target="_blank">GEO</a>(<a href="https://www.ncbi.nlm.nih.gov/geo/browse/" target="_blank">GEO最新数据</a>):GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles.</p>
	
	<p>Genome Browser Gateway:<a href="https://genome.ucsc.edu/cgi-bin/hgGateway?db=hg38" target="_blank">UCSC</a>:The GRCh38 assembly is the first major revision of the human genome released in more than four years. As with the previous GRCh37 assembly, the Genome Reference Consortium (GRC) is now the primary source for human genome assembly data submitted to GenBank. Beginning with this release, the UCSC Genome Browser version numbers for the human assemblies now match those of the GRC to minimize version confusion. Hence, the GRCh38 assembly is referred to as "hg38" in the Genome Browser datasets and documentation. For a glossary of assembly-related terms, see the GRC Assembly Terminology page.</p>
	<p>调控元件数据库:<a href="http://dnase.genome.duke.edu/" target="_blank">Regulatory Elements Database</a>:This database provides a user interface to the results of Sheffield et al. (2013). The data is from hundreds of DNaseI-hypersensitivity and Affymetrix microarray experiments.</p>
	
	<p><a href="http://gdac.broadinstitute.org/" target="_blank">gdac.</a> @ broadinstitute</p>

	
	
	
	
	
	<h2>数据分析工具</h2>
	<p><a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml" class="external text" target="_blank">bowtie2文档</a> RNA-seq比对工具。</p>
	<p><a href="http://xena.ucsc.edu/getting-started/" class="external text" target="_blank">UCSC Xena</a> Securely analyze and visualize your private functional genomics data set in the context of public and shared genomic/phenotypic data sets.</p>
	
	
	
	
	
	<h2>基因分类</h2>
<div id="static" style="border:1px dashed red;">
<style>
#static ul{border:0;}
#static h2, #static h3{background:#fff; border:0;}
#static p{text-indent:0;}
</style>
	<p><a target=_blank href="https://www.gencodegenes.org/pages/biotypes.html">Gene/Transcript Biotypes in GENCODE & Ensembl</a></p>
	<p>Please also compare to the <a target=_blank href="http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html">VEGAdescriptions</a>.</p>
	<p>Further details about the annotation of non-coding RNAs are listed on this <a target=_blank href="http://www.ensembl.org/info/genome/genebuild/ncrna.html">Ensembl page</a>.</p>
	<p>Gencode GTF <a href="data_format.html">format description</a>.</p>
	
	<hr />
	<h1>Vega gene and transcript types</h1>

      <p>Vega shows annotation from different sources and classifies genes and
	transcripts into different classes. Since a single gene often has more
	than one transcript, and these transcripts can be of different classes,
	the classification of the gene as a whole is defined by the transcipt
	with the 'highest' level of classification.</p>      

      <h2>Annotation sources</h2>

      <p>There are three distinct sets of sources for the annotation of genes and transcripts in Vega. These are
	shown as different tracks and have different colour schemes.</p>
      <ul>
	<li><strong>Havana Core Genes</strong> have been annotated in depth to identify
	  alternative transcripts, and are present for all species. They have
	  been annotated by the Havana group at the WTSI.</li>
	<li><strong><a href="http://vega.archive.ensembl.org/info/data/human_lof.html">LoF genes</a></strong>
          show the consequences of variations in sequence on the functional
          properties of human transcripts.</li>
      </ul>

      <p>For each of these sets, genes and transcripts are classified as shown
	below.</p>
      
      <h2 id="genes">Gene Classification</h2>

      <p>Genes can be classified according to their <strong>status</strong>,
	which indicates the type of evidence that supports the annotation, and
	their <strong>biotype</strong>, an indicator of biological
	significance. For simplicity of display, genes are coloured according to
        their biotype only, so for example 'Known Protein coding' and 'Novel Protein
	coding' genes are both shown in the same shade of blue.</p>

      <h3>Status:</h3>
      <ul>
	<li><strong>Known</strong>. Identical to known cDNAs or proteins from the same
	  species and has an entry in species specific model databases: <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search;DB=gene" rel="external">EntrezGene</a> for human dog and pig, <a href="http://www.informatics.jax.org/" rel="external">MGI</a> for	mouse, <a href="http://rgd.mcw.edu/" rel="external">RGD</a> for rat, and <a href="http://zfin.org/cgi-bin/webdriver?MIval=aa-ZDB_home.apg" rel="external">Zfin</a> for Zebrafish.</li>
	
	<li><strong>Novel</strong>. Identical or homologous to cDNAs from the same
	    species, or proteins from all species.</li>
	<li><strong>Putative</strong>. Identical or homologous to spliced ESTs from
          the same species.</li>
	<li><strong>Predicted</strong>. Based on <i>ab initio</i> prediction and for
          which at least one exon is supported by biological data (unspliced
          ESTs, protein sequence similarity with mouse or tetraodon genomes or
          expression data from Rosetta).</li>
	<li>Genes may have no <strong>status</strong> shown where this is not applicable, as for example
          with the majority of pseudogenes.</li>
      </ul>
      
      <h3>Biotype:</h3>
      <ul>
	<li><strong>Protein coding</strong>. Contains an open reading frame (ORF).
	  <ul>
	    <li><strong>Polymorphic</strong>. A protein coding gene that has at least one transcript with a valid
	      ORF and one or more coding transcripts that contain a polymorphism.</li>
	  </ul></li>

        <li><strong>Processed transcripts</strong>. Doesn't contain an ORF. Divided into three major
          categories.
          <ul>
            <li><strong>Long non-coding RNA (lncRNA)</strong>. Subclassified into one of the following types:
              <ul>
                <li><strong>Non coding</strong>. Contains transcripts which are
                  known from the literature to not be protein coding.</li> 
                <li><strong>3prime_overlapping_ncRNA</strong>. Has transcripts where
                  ditag and/or published experimental data strongly supports the
                  existence of long (&gt;200bp) non-coding transcripts that overlap the
                  3'UTR of a protein-coding locus on the same strand.</li>
                <li><strong>Antisense</strong>. Has transcripts that overlap the
                  genomic span (i.e. exon or introns) of a protein-coding locus on
                  the opposite strand.</li> 
                <li><strong>lincRNA (long interspersed ncRNA)</strong>. Has transcripts that are long
                  intergenic non-coding RNA locus with a length &gt;200bp. Requires
                  lack of coding potential and may not be conserved between
                  species.</li>
                <li><strong>Retained_intron</strong>. Has an alternatively spliced transcript
                  believed to contain intronic sequence relative to other, coding,
                  variants.</li>
                <li><strong>Sense_intronic</strong>. Has a long non-coding transcript
                  in introns of a coding gene that does not overlap any exons.</li>
                <li><strong>Sense_overlapping</strong>. Has a long non-coding
                  transcript that contains a coding gene in its intron on the
                  same strand.</li>
                <li><strong>Macro_lncRNA</strong>. Unspliced lncRNAs that are
                  several kb in size.</li>
                <li><strong>Bidirectional lncRNA</strong>. A non-coding locus
                  that originates from within the promoter region of a
                  protein-coding gene, with transcription proceeding in the
                  opposite direction on the other strand.</li>
              </ul>
          </li>

            <li><strong>ncRNA</strong>
              <ul>
                <li><strong>miRNA</strong>. microRNA</li>
                <li><strong>piRNA</strong>. piwi-interacting RNA</li>
                <li><strong>rRNA</strong>. ribsosomal RNA</li>
                <li><strong>siRNA</strong>. small interfering RNA</li>
                <li><strong>snRNA</strong>. small nuclear RNA</li>
                <li><strong>snoRNA</strong>. small nucleolar RNA</li>
                <li><strong>tRNA</strong>. transfer RNA</li>   
                <li><strong>vaultRNA</strong>. Short non coding RNA genes that
                  form part of the vault ribonucleoprotein complex.</li>
              </ul>
            </li>

            <li><strong>Unclassified processed transcript</strong>. Cannot be
                  placed in one of the other categories.</li>
          </ul>

        </li><li><strong>Pseudogene</strong>. Similar to known proteins but contain a
	  frameshift and/or stop codon(s) which disrupts the ORF. These can be
	  classified into the following:
	  <ul>
	    <li><strong>Processed pseudogene</strong>. Pseudogene that lack introns and is
		thought to arise from reverse transcription of mRNA followed by
		reinsertion of DNA into the genome.</li>
	    <li><strong>Unprocessed pseudogene</strong>. Pseudogene that can contain
		introns since produced by gene duplication.</li>
	    <li><strong>Transcribed pseudogene</strong>. Pseudogene where protein homology
		or genomic structure indicates a pseudogene, but the presence of
		locus-specific transcripts indicates expression. These
		can be classified into '<strong>Processed</strong>',
		'<strong>Unprocessed</strong>' and '<strong>Unitary</strong>'.</li>
            <li><strong>Translated pseudogene</strong>. Pseudogenes that have
              mass spec data suggesting that they are also translated. These
		can be classified into '<strong>Processed</strong>',
		'<strong>Unprocessed</strong>'</li>
	    <li><strong>Polymorphic pseudogene</strong>. Pseudogene owing to a SNP/DIP
		but in other individuals/haplotypes/strains the gene is
		translated.</li>
	    <li><strong>Unitary pseudogene</strong>. A species specific unprocessed pseudogene without a parent
		gene, as it has an active orthologue in another	species.</li>
	    <li><strong>IG Pseudogene</strong>. Inactivated immunoglobulin gene.</li>
	  </ul>
        </li>
	<li><strong>IG Gene</strong>. Immunoglobulin gene.</li>
        <li><strong>TR Gene.</strong> T cell receptor gene.</li>
	<li><strong>TEC (To be Experimentally Confirmed).</strong> This is used for
	    non-spliced EST clusters that have polyA features. This category has
	    been specifically created for the ENCODE project to highlight
	    regions that could indicate the presence of protein coding
	    genes that require experimental validation, either by 5' RACE or
	    RT-PCR to extend the transcripts, or by confirming expression of the
	    putatively-encoded peptide with specific antibodies.</li>
      </ul>
      
      <h2>Transcript Classification</h2>
    <p>Transcripts are classified according to their class.
      </p><ul>
	<li><strong>Protein coding</strong>. Protein coding transcripts. These can be
	    further classified as follows:
	  <ul>
	    <li><strong>Known Protein coding</strong>. 100% Identical to RefSeq NP or Swiss-Prot entry.</li>
	    <li><strong>Novel Protein coding</strong>. Shares &gt;60% length with known
		coding sequence from RefSeq or Swiss-Prot or has cross-species/family support or domain evidence.</li>
	    <li><strong>Putative Protein coding</strong>. Shares &lt;60% length with known
		coding sequence from RefSeq or Swiss-Prot, or has an alternative
              first or last coding exon.</li>
            <li><strong>Nonsense mediated decay</strong>. If the coding sequence
	    (following the appropriate reference) of a transcript finishes &gt;50bp
	    from a downstream splice site then it is tagged as NMD. If the
	    variant does not cover the full reference coding sequence then it
	    is annotated as NMD if NMD is unavoidable i.e. no matter what the
	    exon structure of the missing portion is the transcript will be
	    subject to NMD.</li>
            <li><strong>Nonstop decay</strong>. Transcripts that have polyA
              features (including signal) without a prior stop codon in the CDS,
              i.e. a non-genomic polyA tail attached directly to the CDS without
              3' UTR. These transcripts are subject to degradation.</li>
	  </ul>
	</li>


        <li><strong>Processed transcripts</strong>. Doesn't contain an ORF. Divided into three major
          categories.
          <ul>
            <li><strong>Long non-coding RNA (lncRNA)</strong>. Subclassified into one of the following types:
              <ul>
                <li><strong>Non coding</strong>. Contains transcripts which are
                  known from the literature to not be protein coding.</li> 
                <li><strong>3prime_overlapping_ncRNA</strong>. Has transcripts where
                  ditag and/or published experimental data strongly supports the
                  existence of long (&gt;200bp) non-coding transcripts that overlap the
                  3'UTR of a protein-coding locus on the same strand.</li>
                <li><strong>Antisense</strong>. Has transcripts that overlap the
                  genomic span (i.e. exon or introns) of a protein-coding locus on
                  the opposite strand.</li> 
                <li><strong>lincRNA (long interspersed ncRNA)</strong>. Has transcripts that are long
                  intergenic non-coding RNA locus with a length &gt;200bp. Requires
                  lack of coding potential and may not be conserved between
                  species.</li>
                <li><strong>Retained_intron</strong>. Has an alternatively spliced transcript
                  believed to contain intronic sequence relative to other, coding,
                  variants.</li>
                <li><strong>Sense_intronic</strong>. Has a long non-coding transcript
                  in introns of a coding gene that does not overlap any exons.</li>
                <li><strong>Sense_overlapping</strong>. Has a long non-coding
                  transcript that contains a coding gene in its intron on the
                  same strand.</li>
                <li><strong>macro_lncRNA</strong>. unspliced lncRNAs that are several kb in size.</li> 
              </ul>
          </li>

            <li><strong>ncRNA</strong>
              <ul>
                <li><strong>miRNA</strong>. microRNA</li>
                <li><strong>piRNA</strong>. piwi-interacting RNA</li>
                <li><strong>rRNA</strong>. ribsosomal RNA</li>
                <li><strong>siRNA</strong>. small interfering RNA</li>
                <li><strong>snRNA</strong>. small nuclear RNA</li>
                <li><strong>snoRNA</strong>. small nucleolar RNA</li>
                <li><strong>tRNA</strong>. transfer RNA</li>   
                <li><strong>vaultRNA</strong>. Short non coding RNA genes that
                  form part of the vault ribonucleoprotein complex.</li>
              </ul>
            </li>

            <li><strong>Unclassified processed transcript</strong>. Cannot be
                  placed in one of the other categories.</li>
          </ul>

	</li><li><strong>Pseudogene</strong>. Have homology to proteins but generally
	    suffer from a disrupted coding sequence and	an active homologous gene can be
	    found at another locus. Sometimes these entries have an intact
	    coding sequence	or an open but truncated ORF, in which case there is other evidence
	    used (for example genomic polyA stretches at the 3' end) to classify
	    them as a pseudogene. Can be further classified as follows:
	  <ul>
	    <li><strong>Processed pseudogene</strong>. Pseudogene that appears to have
		been produced by integration of a reverse transcribed mRNA into
		the genome.</li> 
	    <li><strong>Unprocessed pseudogene</strong>. Pseudogene that shows evidence
		of loss of function, but has exon-intron structure.</li>
	    <li><strong>Transcribed pseudogene</strong>. Pseudogene where protein homology
		or genomic structure indicates a pseudogene, but the presence of
		locus-specific transcripts indicates expression. These
		can be classified into '<strong>Processed</strong>' and
		'<strong>Unprocessed</strong>'.</li> 
            <li><strong>Translated pseudogene</strong>. Pseudogenes that have
              mass spec data suggesting that they are also translated. These
		can be classified into '<strong>Processed</strong>' and
		'<strong>Unprocessed</strong>'</li>
	    <li><strong>Polymorphic pseudogene</strong>. Pseudogene owing to a SNP/DIP
		but in other individuals/haplotypes/strains the gene is
		translated.</li>
	    <li><strong>Unitary pseudogene</strong>. A species specific unprocessed pseudogene without a parent
		gene, as it has an active orthologue in another	species.</li>
	    <li><strong>IG Pseudogene</strong>. Inactivated immunoglobulin gene.</li>
	  </ul>
	</li>       
	<li><strong>TEC (To be Experimentally Confirmed)</strong>. This is used for
	    non-spliced EST clusters that have polyA features. This category has
	    been specifically created for the ENCODE project to highlight
	    regions that could indicate the presence of novel protein coding
	    genes that require experimental validation, either by 5' RACE or
	    RT-PCR to extend the transcripts, or by confirming expression of the
	    putatively-encoded peptide with specific antibodies.</li>
	<li><strong>Artifact</strong>. Used to tag mistakes in the public databases
	    (Ensembl/SwissProt/ Trembl). Usually these arise from
	    high-throughput cDNA sequencing projects, which submit automatic
	    annotation sometimes resulting in erroneous coding sequences that
	    are, for example, 3' UTRs.</li>
	<li><strong>IG Gene</strong>. Immunoglobulin gene.</li>
        <li><strong>TR Gene</strong>. T cell receptor gene.</li>
      </ul>
      
    </div>
	
	













<pre>
2019.9.23 新增基因分类
</pre>


</div>
